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Abstract 

Frcqucntist and likelihood methods of inference based on the multivariate skew- 
normal model encounter several technical difficulties with this model. In spite 
of the popularity of this class of densities, there are no broadly satisfactory so- 
lutions for estimation and testing problems. A general population Monte Carlo 
algorithm is proposed which: 1) exploits the latent structure stochastic repre- 
sentation of skew-normal random variables to provide a full Bayesian analysis 
of the model and 2) accounts for the presence of constraints in the parameter 
space. The proposed approach can be defined as weakly informative, since the 
prior distribution approximates the actual reference prior for the shape param- 
eter vector. Results are compared with the existing classical solutions and the 
practical implementation of the algorithm is illustrated via a simulation study 
and a real data example. A generalization to the matrix variate regression model 
with skew-normal error is also presented. 

Keywords: Bayes factor. Matrix variate regression. Objective Bayes inference, 
Population Monte Carlo, Reference prior, Skewness 



1. Introduction 

The skew-normal (SN hereafter) class of densities has ind ependent l y and 
recu rrently appeared in statistica l literature: see for example Roberts! ( 19661 ) 
and O'Hagan and Leonard! ( 19761 ): it was named by Azzalini (Il985) and fur- 
ther generalised to the multiva riate case by lAzzalini and Dalla Valle ( 1996( ) and 
Azzalini and Capitani 3 (Il999l) . The appearance of the multivariate version is 
to be considered the starting point of a dramatically prolific line of research, 
both from a methodological and an applied perspective. Comprehensive ac- 
counts of the huge production of papers and applications related to the SN 
model and its ramifications can be found, for example, in the book edited by 
Genton ( 2004 ). or in the review paper bv lAzzalini ( 2005 ). The popularity of 
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this class of distributions steins mainly from its ability to capture and explicitly 
model mild departures from symmetry, without losing mathematical tractabil- 
ity, which can be particularly useful in real data applications. Another reason 
for the popularity of the SN class is because it naturally arises in real data 
analysis under special mec hanisms of data coll e ction , such as hidden truncation 
or selective reporting: see [Arnold and Beaver (I2OO2I) . A deeper analysis of the 
literature, however, reveals that most of the existing results are restricted to the 
distributional theory of skew-normal and, more generally, skew-elliptical distri- 
butions. On the o ther hand, the theory of infer ence is still problematic even 



in the sc alar case (lAzzalini and Capitanid . 120031) . These problems were antic 



jap 

ipated in lAzzalinil (|1985[ ) and iLiseol (|1990[ ). and are basically due to a number 
of anomalies of the likelihood function: for instance, under the scalar stndard 
skew-normal model, there is a positive sampling probability that the maximum 
likelihood estimator will produce infinite values; specifically, this phenomenon 
occurs when all the data points have the same sign. These difficulties tend to be 
more challenging in the multivariate set-up where, in addition, "problematic" 
situations are not so easy to detect. Even ignoring these pathological cases, the 
likelihood surface arising from an i.i.d. sample of skew-normal random variables 
is often non regular and maximum likelihood estimates (MLE, hereafter) tend 
to be unstable. 

In this paper we describe a full Bayesian analysis of the multivariate SN 
model. In particular we propose: 

• to use objective priors, in order to correct the odd behavior of the likeli- 
hood function without introducing external information; 

• to exploit the latent structure of the SN model in order to tailor a specific 
version of a Population MonteCarlo (PMC, hereafter) algorithm, and to 
produce valid posterior inferences, in terms of estimation and testing. 

The paper is organized as follows: Section [2] introduces the multivariate SN 
model and presents a few examples that motivates the proposal of the paper. 
Section 13.11 introduces an augmented likelihood function which exploits the in- 
trinsic latent structure of the skew-normal model. In Section we discuss the 
choice of prior distributions; in Section |4] we describe a PMC algorithm with 
proposal dens i ties b ased on the full conditional distributions of the parameters 



(jCeleux et al.l . 120061 ): in Section [5] we discuss the testing and model selection 



problems, where a comparison between the nested normal and the skew-normal 
model may be of interest. Section |5] generalises the approach to the matrix vari- 
ate regression model, which is useful when a set of covariates is available. We 
also discuss some technical and practical issues related to the algorithm. Finally, 
Section[7]deals with some numeric comparisons with other existing methods and 
the analysis of a financial data set. 
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2. Motivations 



A random vector X is said to Iravc a p-dimensional standard SN distribution, 
with correlation matrix $1 and shape parameter a when its density function is 

/ (x; O, a) = 2(^p (x; O) *i [a'x] , x,a£W, (1) 

with (^p(w,A) denoting the density of a p-dimensional normal random vec- 
tor with standard marginals and covariance matrix A, evaluated at w S W, 
and $i(w) is the cumulative distribution function of a standard scalar normal 
random variable. Note that is a correlation matrix, although it is not the 
correlation matrix for the components of X; it even appears in the standard 
version of the SN model. It is easy to generalise the model with the inclusion of 
location and scale parameters. Let ^ be a p-dimensional vector of real numbers 
and 

o) = diag (wi, . . ■,ujp) 

be a diagonal matrix with the marginal scale parameters, so that S = ijjQijJ rep- 
resents the scale matrix; then Y = ^ + u;X has a p-dimensional SN distribution 
(SNpCEj^jOt), hereafter) with density 

/(y; S, a) = 2^p(x - ^ ; [a'uj-' (x - |)] . 

In this parameterization, each component of th e shape parameter a can tak e 
any real value. An alternative parameterization ( Azzalini and Capitaniol 1999I ). 
defined in terms of (5, exists, namely 

a = {i-s'n-^s)-^n~^5, (2) 

or equivalently, 

S = {1 + a'ftayifla. (3) 

Notice that, although each component Sj takes values in [—1,1], the entire 
vector 5 belongs to an ellipsoidal subset of [—1, 1]^ whose shape is regulated by 
n. Although this problem is crucial in any simulation based Bayesian approach 
for inference, it seems to have been neglected in the literature; we will return to 
this issue below. Another possible parameterization , which is particularly useful 



for lik elihood-based inference, has been proposed in I Arellano- Valle and Azzalini 



( 2008 ) 



Consider now the simplest inferential situation, where one observes an i.i.d. 
sample y = of n observations from an SNp{H,^,a) population. 

The likelihood function is then 

L(S,C,a;y) (X |Srt exp|-ifj[(y.-0'S-i(y.-0]| 

n 



X 

i=l 
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This likelihood function is quite difficult to manage ( Azzalini and Capitaniol . 



19991) : there are no closed form expressions for the maximum likelihood esti- 



mator and, as anticipated, the MLE of a. can be infinite even in very simple 
settings. Consider, for example, the case p = 2, when all the parameters except 
OL are known: suppose we observe the following bivariate random sample of size 
10; the first (second) row indicates Xi {X2) values: 

"-0.272 0.340 0.498 1.511 -0.134 0.170 -0.169 0.484 -1.042 0.945 
1.421 0.668 1.610 -0.610 0.577 -0.168 2.222 -0.606 1.789 0.361 




Figure 1: An example of an ill-behaved likelihood function. 



Figure [2] depicts the contour plot of the likelihood function for a. = {ai,a2); it 
is clear that the MLE of the vector cx is infinite: the R function msn.mle in the 
suite sn provides the estimates 

(ai,d2) = (24776144,19911143). 

The unsatisfactory behaviour of the maximur n likel i hood method is not imme- 
diate clear from the sample values. Table 3 in lElind (I2OI2I) shows an even more 
dramatic example with real data in the context of the skew-t model. 

To emphasize this point, we have generated 2000 samples of size 30 from a 
SN2 density with | = (0, 0), Yj = I2 and a ~ (2, 2). Point estimates of the shape 
vector have been obtained, based on the R suite sn, which can be considered as 
a benchmark in this context. Out of 2000 samples, about 38% resulted in an 
infinite estimate for a; Figure [2] shows the subset of the finite point estimates 
for a. Of course, this admittedly unsatisfactory behaviour tends to be even 
worse for smaller sample sizes and/or for larger values of a. While in the scalar 
case the set of samples producing infinite ML estim ates of a (or | 5 |= 1) can be 
exactly characterized ( Liseo and Loperfidol 2006h . the detection of such cases 
in the multivariate case is more complicated. 
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Figure 2; Sampling distribution of the (finite-valued) maximum likelihood estimates of a. 
Solid lines indicate the true value. 



A theoretical justification for tlie unsatisfactory behaviour of the maximum 
hkeUhood estimates is that the symmetrized Kullback-Leibler divergence be- 
tween two SNp densities with similar values of a tends to be very small; this 
fact typically produces a profile likelihood for oc which is rather flat over a large 
portion of the parameter space. Another way of interpreting the difficulties of 
a likelihood approach, at least in a simple setting, is the following. For a fixed 
positive value z, consider, as a function of a, the likelihood ratio between a 
standard normal density and an SN{0^ 1,Q;) one, with positive a, that is 



Since LR{a) is decreasing, for any fixed positive z, in (0, oo), its possible values 
range from 0.5 (when a — >■ cxj, that is for a half- normal density) to 1 (for 
a = 0); in other words the ability of the likelihood to discriminate between a 
normal and a skew-normal model seems quite limited. One possibility is then to 
switch to the production of valid interval estimates. However, solid classical and 
likelihood theories of confidence intervals for the SNp model are still lacking. 
Another technical inferential problem with the SN model is that the likelihood 
function may be multimodal when both the location and the shape parameters 
are unknown: we will discuss this issue below in this section. 

For all these reasons, we propose a full Bayesian analysis of the multivariate 
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SN mode l. A Bayesian analysis base d on objective priors has alrea dy been pro- 
posed by iLiseo and Loperfido ( 2006 ) for the scalar case. See also Wiper et al. 



posea by i-Liseo ana i^opernaoi ii^uuta i tor tne scalar case, bee also i wiper et ai. 
(|2008|) for an objectiv e Bayesian analysis in the half-normal and half-t cas e s, and 



Branco et al. ( 2013af ) for the skew-t case. iFriihwirth-Schnatter and" Pvnd (|2010[ ) 



have recently proposed a fully Bayesian analysis of a mixture of skew-normal 
and skew-t densities. Other recent and important ad vances in the app l icatio n 
of multivariate skew-normal models can be found in iFung and Senetal 1I2OIOI) 
Panagiotelis and SmithI (l2010l).lFerr az and Moural (|2012|) andlCabral et a.1.1 (|2012|) . 



The computational approach described in iFriihwirth-Schnatter and Pvnd (12010^ 
differs from ours in two respects: i) they adopt conjugate priors in order to fa- 
cilitate a Gibbs sampling strategy for simulating from the posterior; ii) as a 
consequence of i), we adopt a different sampling strategy, based on importance 
sampling rather than MCMC; we will describe the PMC algorithm in detail in 
Section U) For the moment we explain why we are not completely confident 
with the use of Gibbs-type algorithms for skew-normal or skew-t models. It is 
a well-known fact that likelihood functions arising from a skew-normal model 
may be multimodal. In these situations, the Gibbs sampler chains are often 
captured by one of the modes. As a consequence, the chains do not mix well 
and the posterior distribution is not well explored. 

As a practical illustration of the problem. Figure [3] pr esents the 1000 draws 

obtained from a Gibbs sampler - similar to that proposed in Friihwirth-Schnatter and Pvnd 

- in the very simple setting of a scalar skew-normal model with unknown 
location ^ and shape a, and a known scale parameter uj = 1. Almost all poste- 
rior draws belong to the same mode and the posterior distribution is not well 
explored. In the multidimensional case, things tend to be more complicated; as 
we will argue in Section [SUl constraints in the parameter space should be intro- 
duced in order to obtain a positive definite correlation matrix, and accounting 
for them in the Gibbs sampling algorithm may not be easy. 



3. Augmented likelihood function and priors 

3.1. Introducing the latent structure 

In this section we describe how to exploit the intrinsically latent structure of 
the SN density function in order to produce an augmented likelihood function. 
The main proposition follows. 

Proposition 1. Let fl be a correlation matrix, S a p- dimensional vector and 
a = (1 - 8^Q."'^5)~^Vl''^5. Define 



p+i 



1 5' 

8 n 



and U 




Then, (a) the random vector Y = 0^11 + ^ ~ S'A^p(S, ^, a), with S = cliJIo;, 
and (b) the joint density of (Y , Z) is given by 



fp+i (y, z) = fp{y I z)f{z) ^Np{^ + ujS\z\,ij{n- SS')u:) x iVi(0, 1) 
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Figure 3: Scalar skew-normal example; posterior distribution for ip) - where i/i = 
Lja/^/l + and 1000 values drawn from a Gibbs algorithm. 

Proof: (a): From one of the possible definitions of a multivariate SN r.v., it is 
known that U ~ SNp{ft,0,a); Y is a simple linear transformation of U and 
its distribution is readily obtained. 

(b): Start from f{y, z) = f{z)f(jj \ z). Then f(z) is, by assumption, a standard 
Gaussian density, while 



Then, by using simple results on conditional Gaussian densities, one gets 



Using the above proposition one can write an augmented likelihood function, 
"as if" we had observed, for each sample unit, the latent value Zi, i = 1, . . . ,n. 
Write — u)d and a;(r2 — dS')u) = S — tptp' = G; define the parameter vector 
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as 6^ = (<5, S, ^) or 6 = (i/;, G, The augmented likelihood function is then 

n 

i(6>;y,z) cx Y[{(pp{y^ - ^ - rp \ z, l;!: - ipip') x ipi_{z,;l)} 
1=1 

Notice that the matrix G must be positive definite; this imphes a logical con- 
straint among the values of d and Jl in the original parameterization which 
should be taken into account when exploring the parameter space via simulation 
methods. As wc have already noticed, this issue seems to have been neglected 
in literature. See Azzalini's website http://azzalini.stat.unipd.it/SN/, 
under the section "A less frequent question" for a graphical treatment of this 
problem. In particular, MCMC methods should be used with care in order to 
avoid the chain in the (S, Jl) parameterization visiting inadmissible parts of the 
parameter space. 



3.2. Prior distributions 

Our primary goal here is to propose a general method of inference for the 
parameters of the multivariate SN distribution. For these reasons wc have tried 
to be as "objective" as possible in choosing the prior for the parameter vector. 
However, it is not easy to derive a formal Jeffreys or reference prior for the 
parameters of a multivariate skew-normal distribution. In this paper wc have 
assumed a priori, as usual, ^ i ((5, S). Also wc have assumed a flat prior for 
the "location" parameter ^ and a conjugate normal Inverse Wishart prior for 
the "scale" parameter S, that is 

7r(|) ocl and S - IWp{m, A). 

Obviously, one can always consider the limiting case (m 0, A -> 0) to get the 
classical Jeffreys prior 

^(^>S)(x— (4) 



The ch oice of a good objective prior for 5 (or a) is more delicate. iLiseo and Loperfido 



( 2006f ) have shown that, in the univariate SN model, the Jeffreys' prior for the 



shape parameter a is proper; its use, in a sense, automatically and pragmat- 
ically solves the problem of a potentially non vanishing likelihood function , 
which can happen with the skew-no rmal model ( Azzalini and Capitanio . 19991 ). 



iBranco and Baves Rodrigued ( 2007 ) have shown that the Jeffreys' prior can be 



adequately approximated by a Student t density with a half deg r ee of fr eedom 
centered at zero and with scale parameter 7r^/4. IBranco et al ] (l2013bl) . m an 



as yet unpublished technical report, have partially generalised the above results 
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to the bivariate case, but no general results are available for the SNp model 
with p > 2. They have proved that, unlike the scalar case, the Jeffreys' prior 
is improper in the bivariate case. On the other hand the one-at-a-time refer- 



ence prior ([Berger and Bernardol Il992[ ) is proper although its expression is quite 
complicated. In particular, the Jeffrey s' prior in the a parameteriza t ion (u sing 
the same approximation provided bv iBranco and Baves Rodriguez ( 2007 ) for 



the scalar case) is 



TT (ai,a2 OC- „ — ^ 

^ ' ' l + 7r2/2(af + 



The proper reference prior when ai is the parameter of interest and a2 is con- 
sidered a nuisance parameter is given by 7r^(a2 | ai)TT^ {ai) where 



(Q!2 Ctl) OC — — = (5) 



and 



TT^iai) (X cxp (^-^J log (1 + 2r/^(a? + t^)) Trait \ ai)d?j 



(6) 



where rj ~ -k 12. Of course, when 0,2 is the parameter of interest the prior is the 
same with a\ and 0,2 switching their roles. The above considerations show that 
an objective analysis can be made only for one component of the shape vector: 
to get a proper posterior with sampling probability 1, in the multivariate case, 
one should introduce genuine prior information for some of the components of a. 
For practical purposes, a prior can be chosen in the following way: in the scalar 
case the approximate Jeffreys' prior for /? = (1 -I- (5)/2, with 8 — a/ ^/l + c? , is 
a Beta(0.25, 0.25) prior; in analogy with that, one can use, in the multivariate 
case, the prior 

n^^^{8)^\[{l-5])-\ (7) 

that is, wc assume that the components of the skewness vector are, a priori, 
independent and identically distributed. Although independence can be con- 
sidered a strong assumption, it is hard to conceive any non subjective form of 
dependence. Alternatively, the use of a uniform prior in the 5 parameterization, 
especially for p > 2, could be suggested. Using the Jacobian 



85 



da 

the uniform prior in the 5 parameterization is transformed into 



^^(a) (X (8) 

which explicitly introduces a dependence on the correlation matrix fJ. Notice 
also that any prior distribution on S should be considered only for those values 
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Figure 4: Left panel: different shapes of the parameter space of S for different values of p. 
Right panel: dots represent numerical evaluations of A{ii) for different values of p, the solid 
curve represents the approximation given in cq. l llOII . Vertical lines denote the same values of 
p considered in the left panel. 

which satisiy the constraints iUustrated at the end of section 13.11 In this per- 
spective, for computational reasons, we wiU consider the parameter constraint 
as generated from the prior rather than from the hkehhood. In the rest of the 
paper, we will then consider, as the prior for 6, 

where A(r2) is the integral of ([7]) over the parameter values such that det(G) > 0. 
In fact, it is important to notice that, given the hierarchical structure of the prior 
for 5, one needs to recover the normalizing constant A(Cl). This is analitically 
feasible only for particular choices oi tt{6 \ fl). In all other cases, one needs to 
evaluate the integral oi tt{5 \ H) on the ellipsoid determined by the constraint. 

We now give some details about the evaluation of A{n) when p = 2. Gen- 
eralizations to higher dimensions arc similar, although more complicated. In a 
bivariate setup, we define p to be the off-diagonal element of $7; the constraint 
produces an ellipse which is a proper subset of the square [—1,1]^ (see figure 
131 left panel): the shape of the ellipse is a function of p. A{Cl) can be evalu- 
ated on a grid of values of p, for example using a rejection sampler where the 
simulated values of S are generated by two independent Beta(0.1, 0.1) deviates. 
The choice of the proposal is due to the fact that the Jeffreys' prior puts most 
of the probability mass on the boundary of the square. Results on a grid of 
values of p are represented in the right panel of fig. U by dots. This approach 
can be computationally demanding. For practical purposes, a very satisfactory 



10 



approximation can be obtained using the formula 

A{fl) ^ a{l^ p^. (10) 

Estimation of a and b is then straightforward. We have obtained a = 6.68 and 
b = 0.28. 

Although the above defined 6* parameterization is more suitable for elicita- 
tion, the alternative 9 parameterization should be preferred in terms of imple- 
mentation and computation. From now on, we will use 9 = {i/j, G, ^). This can 
be simply done by introducing a Jacobian term in the prior, namely 

j(r->0) = n(G,,+^|)-^ (11) 



4. Population Monte Carlo algorithm 

In this section we illustrate a PMC algorithm for obtai ning a sample from 



the joint posterior distribution of 9. PMC methods (see e.g. ICappe et al.l . 120041) 
essentially consist of an iterated version of the importance sampling algorithm: 
at each iteration, a population of particles is generated, independently of each 
other, possibly using a set of different importance functions. Performances ob- 
tained in the past iterations by the different kernels are typically evaluated in 
a relative way in order to adaptivcly modify the proposal distributions over the 
iterations. 



Alternatively, ICeleux et al. 1 (l2006l) suggest the use of the full conditional dis- 



tributions as the importance functions when the model at hand has a latent 
structure representation, as in the present case. This way, one can exploit the 
easiness of proposing from a natural importance function, i.e. the full condi- 
tional, and, at the same time, avoid the convergence issues of a generic MCMC 
method. Also, the coexistence of different particles, and the competition be- 
tween them, allows us to tackle better the issue of multimodality of the posterior 
density. It is well known that in similar situations the Gibbs sampler tends to 
be attracted by one of the modes and hardly escapes from a neighborhood of it 
(ICeleux et al.l . boOOl ) . 



From a model selection perspective, the estimation of the normalising con- 
stant of 7r(0|y) can be performed as a simple by-product of any PMC (and MC) 
sampler. In fact, from the importance sampling identity, one obtains 

P(y)= / / TT{9,z,y)d9dz^ f f ^1^^^2llq{9,z)d9dz, 

where q is the proposal distribution. Adopting the usual Monte Carlo approxi- 
mation, p{y) can be estimated by 
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where the Q^s denote the unnormalised importance weights, and 

N 



^t = -Ec'*'iog(cf' )• (13) 



i=l 



is an entropy measure of performance of the t-th iteration of the algorithm. 
Ht takes high values when the normalised weights of the particles in the t- 
th iteration are concentrated around 1/N. The quantity (1131) is a m onotonic 



transformation of the perplexity measure (jRobert and Casehal |2010| ). defined 



as eyip{Ht)/N. We will use the perplexity index as a measure of non degeneracy 
of the PMC algorithm, which is often considered as one potential drawback of 
Monte Carlo methods. Last but not least, the use of PMC algorithms allows 
the simultaenous draw of all the particles: this fact dramatically improves the 
efficiency of the algorithm compared with generic MCMC approaches. The es- 
timator (jl2p is quite simple and stable since it does not rely upon convergence 
issues. A possible improvement on (fT^ can be obta i ned vi a the Adaptive Multi- 



ple Importance sampling technique, ICornuet et al. I (I2OI2I) . The difference with 



PMC is that, in this case, the importance weights of all simulated values, past 
as well as present, are recomputed at each iteration. We are currently working 
on these aspects. 

Without loss of generality, we illustrate the steps of the algorithm for a bidi- 
mensional setup. The generalization to higher dimensional problems is straight- 
forward, even though numerical problems can arise with some choices of the 
prior for S, and care must be taken in handling the approximation of A{^1). 
In the simulation study described in Section [7] we reported perplexities of our 
samples, although we have never experienced significant problems in terms of 
degeneracy: however we recognize that this can be a critical issue when the 
proposal densities are not well calibrated. 

After approximating A(17), the PMC algorithm can be initialised by sam- 
pling random starting particles. These particles will be updated in the following 
iterations using, as proposal distributions, the full conditionals (when available 
in closed form) or some other distributions which approximate them. 
In particular, the full conditional distributions of the latent variables Z^'s (see 
figure [S]) are symmetric about the origin: 



where 



ip{zi - mi; v) Zi>0 
(p{zi +mi;v) z, <0 



v = {i + xp'G-^'ipy\ 

and rrii is the i-th component of the vector 

m = v [(y - 1„ ® |)'G- V] ■ 

It is not necessary to sample the signs of Zi's, as z is involved in the posterior 
distribution only with its absolute value, and the full conditional distribution 
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Figure 5: Some examples of the full conditional distribution of a generic Zi for different values 
of rrii. 



of \zi\ is just a truncated normal. Nevertheless, we prefer to directly sample 
the value of z, as potential asymmetries in the posterior draws could highlight 
potential problems in the sampler. Hence, we only have to sample the sign 
(each sign having probability 1/2) independently of \zi\. Generation of the 
ZiS can be done using several methods; troubles can arise when m.i/v^/'^ takes 
large negative values: in this case, sampling from the very extreme tail of a 
normal distribution using an accept-reject algorithm can be intensive while the 
inversion method may give numerical ly unreliable results; i n these cases we have 



employed the approach described in iPhilippe and RobertI (|2003[ ). essentially a 
perfect sampling algorithm. 

Simple algebra leads to the full conditional for ^: 

[^|...]^iVp(y-mG/n), 

where y is the sample mean vector, and |z| is the mean of the absolute values 
of the ZiS. Finally, G and ijj have non-standard full-conditional distributions: 

7r(G| • • • ) oc 7r(G)/Wp(n + m, A*), 

where 7r(G) is the prior for G arising from Q and a Jacobian argument, and 

n 

A, = A + ^(y, - Vk^l - i){y^ - V'N.I - 0'- 

i=l 

We use a IWp{n + m, A*) distribution (which is the one obtained by the full 
conditional distribution ignoring the contribution of the prior) to propose values 
for G, as this distribution will resemble the full conditional, in particular for 
large sample sizes. Finally, the full conditional distribution of xp is proportional 
to 

ELll^^:|(y^:-€). G 
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In this case, it is possible to consider several proposal distributions; in order 
to minimize the computational burden, we propose to sample values from the 
p-variate normal "part" of the full conditional distribution. 

As usual, we compute the importance weights C as the ratio Tr{6, z \ y)/q{d, z), 
where tt is the posterior density in which the prior for {if) \ ft) has been (ap- 
proximately) normalised, and q is the joint proposal density. Particles will 
be multinomially rcsampled with unnormalised weights given by C and the re- 
sampled particles will represent the starting point for the particles of the next 
iteration. 



5. Bayes factor 

One of the main reasons for the popularity of the multivariate skew-normal 
model is that it represents a proper generalization of the multivariate normal 
model. Then it is often important to test the normality of the dataset versus 
skew-normal alternatives. Here we will use the Bayes factor to compare the 
multivariate Gaussian model - say Mq - versus the multivariate skew-normal 
one - say Mi. To this end, we need to evaluate the predictive distribution of 
the data under the two competing models. Suppose that, under model Mq, xp 
is set equal to 0. Then the model is described by the following assumptions: 

. yi,...,y,/^'iVp(c,s); 



1 calculation to show that the margir 
lal model is 

2"/2*,((n-l)/2) 



it is a standard calculation to show that the marginal distribution of the data 
under the normal model is 



^p{n~p~l)/4 I g |(«-l)/2 „«/2' 

(14) 



where S is the sample covariance matrix and 

3 = 1 ^ 

is the multivariate Gamma function. Notice that, since the Jeffreys' prior 
7r''(^, S) is improper, quantity ()14p is meaningless per se. However, if we use the 
same - improper - prior for the common parameters of the two models (in this 
case ^ and S), then the Bayes factor is a well-defined tool for model comparison. 
To compute the Bayes factor Bio for comparing the skew-normal model and 
the nested normal model we need an estimate Pi(y) of Pi(y), the marginal 
distribution of data under the skew-normal hypothesis. We then perform T 
iterations of PMC algorithm and we sample TV particles in each iteration. Using 
p^ . the final estimate of the Bayes factor is then 

^ l^t=i^tl. iQ, 
Po{y)Nj:tiHt 
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6. Some discussion and extensions 



The computational approach we have discussed in the previous sections can 
be easily adapted to more general situations. In the presence of k covariates, the 
location parameter vector ^ should be replaced by a A: x p matrix of regression 
coefficients B, so that our model gets transformed into 

y^i^^ SNiG,X'Ji,^jJ), * = l,2,...,n. 
The augmented likelihood for this new model is then 



L(B,G,^|Y,z) « |G|-"/2exp|-iz'z| 

I i=i 



X,B~^P\z,\yG-Hy,-X,B^^P\z,\) 



The previous PMC sampler is still valid for this model; the only necessary 
modification is the introduction of a proposal step for B in lieu of ^. Adopting 
a fiat prior for the elements of the matrix B, we again use the full conditional 
distribution of B as our proposal. It is easy to show that 



where 



and the symb ol MiV(]V[, R, A) refers to a matrix normal random variable V 
( Dawidl . Il98ll ). with location M and scale parameters R and A, with density: 



/(V|M,R,A) 



exp { i tr[R-i(V - M)'A"^(V - M)]} 

(27r)'W2|R|n/2|A|P/2 ' 



Simulating draws from this distribution is simple, as it is linked with the mul- 
tivariate normal distribution by a simple relation: 

V - Af7V„xp(M, R, A) if and only if vec(V) - Ar„p(vec(M), R ® A), 

where (X) denotes the Kronecker product. 

The Bayesian approach through data augmentation is also particularly useful 
in problems with missing data. Our algorithm can be easily adapted to account 
for missingness and a co mparison of our app roac h with the one b ased on the 
EM algorithm proposed in Lin and Lin ( 2011 ) and Lin et al. ( 2009f ) is currently 
under investigation. 
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7. Simulations and examples 

In this section wc consider the frequentist properties of our Bayesian pro- 
cedure; in particular we simulate samples of size ni = 50 and 712 = 200 for 
different combinations of parameter values. In all simulations we have used 
20, 000 particles for 20 iterations, setting ^ = (3, 3)' and wi = cj2 = 1- 
Table [T] shows a summary of the results: for each parameter combination we 
provide 

• the frequency of times that Biq provides evidence in favour of the normal 
or the skew-normal model (second and fourth column) or that it does not 
provide strong evidence for any of the two models (third column); 

• the median of the simulated sampling distribution of the posterior median 
(fifth column); 

• the frequentist coverage of the one-sided 0.95% and 0.9% credible sets 
(columns FC0.95 and FC0.9); 

• the median of the sampling distributions of the posterior mean E(?/'i|y) 
(eighth column, denoted by MeMean), Conditional (on the true z^'s) MLE 
(ninth column, denoted by MeCMLE) and MLE (tenth column, denoted 
by MeMLE); 

Results for the skewness parameter are shown for the first component of the 
vector ipi, similar results are obtained for ip2- might be expected, both 
the likelihood and Bayesian approaches successfully estimate the off-diagonal 
element of G, while estimation oi xjj - and, consequently, of ^ - seems more dif- 
ficult. Table [1] highlights the difficulties in catching skewness in small datasets, 
which implies a high rate of wrong answers given by Bayes factors for non- 
normal samples. With respect to the ML estimator, it should be noticed that, 
even though the medians of the sampling distributions of the MLE are quite 
precise, this estimator always shows a non-negligible probability of producing 
infinite estimates. 

Figure [5] compares the estimates obtained using our approach and the maxi- 
mum likelihood method in the most extreme combination of parameters, with 
p = -0.5, u) = (1,1)' and = (0.495,0.495)'. The value of lies on the 
border of the acceptable region and corresponds to a = (7, 7)'. The first row 
shows a comparison between our Bayesian estimates i/^f*^*^, and i/^f*^^, the 
complete maximum likelihood estimates. These estimates are obtained using 
the true values of the latent variables Zi, thus bringing the problem back to a 
multivariate normal regression model of y over |z|, in which ^ and xp play the 
roles of an intercept and a slope. In fact, for known values of z|, the complete 
likelihood function reduces to 

L{9; y,z) cx \G\-"/' exp | -1 f^(y, - | - tMz,|)'G-i(y, - | - t^z,!) I , 
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n = 30 







True values 






Bio < 0.5 


0.5 < Bio 


< 2 


Bio > 2 


Med{Med{ipi)) 


FC0.95 


FC0.9 


MeMean 


MeCMLE 


MeMLE 




= 02, 


p = 


(« 


= 02) 


0.997 


0.003 




0.000 


0.007 


0.983 


0.971 


0.008 


0.010 


-0.015 




= 0.52, 


p = 


(a^ 


' O.7I2) 


0.988 


0.011 




0.001 


0.031 


0.926 


0.874 


0.026 


0.516 


0.085 




= 0.72, 


p = 


(a^ 


^ 4.952) 


0.877 


0.098 




0.025 


0.362 


0.806 


0.728 


0.308 


0.696 


0.582 




= 02, 


p = 0.5 


(« 


= 02) 


0.996 


0.003 




0.001 


0.011 


0.979 


0.959 


0.007 


0.022 


0.123 


v = 


= 0.52, 


p = 0.5 




i O.4I2) 


0.989 


0.009 




0.002 


0.031 


0.900 


0.852 


0.026 


0.519 


0.146 


xp-- 


= 0.72, 


p = 0.5 




i 0.792) 


0.940 


0.042 




0.018 


0.163 


0.782 


0.660 


0.133 


0.702 


0.385 




0.4952, 


p= -0.5 


(« 


= 72) 


0.958 


0.034 




0.008 


0.161 


0.874 


0.838 


0.116 


0.499 


0.483 


n - 


= 200 
































True values 






Bio < 0.5 


0.5 < Bio 


< 2 


Bio > 2 


Med{A4ed{ipi)) 


FC0.95 


FC0.9 


MeMean 


MeCMLE 


MeMLE 


xp 


= O2, 


p = 


(« 


= 02) 


0.999 


0.001 







-0.002 


0.873 


0.816 


-0.003 


-0.007 


0.045 


xp-- 


= 0.52, 


p = 


(a^ 


^0.7l2) 


0.995 


0.004 




0.001 


0.211 


0.709 


0.634 


0.187 


0.500 


0.267 


xp-- 


= 0.72, 


p = Q 


(a^ 


i 4.952) 


0.006 


0.005 




0.989 


0.573 


0.737 


0.635 


0.569 


0.699 


0.687 


xP 


= 02, 


p = 0.5 


(« 


= 02) 


0.999 


0.001 




0.000 


0.027 


0.820 


0.724 


0.031 


-0.005 


0.071 


xp-- 


= 0.52, 


p = 0.5 


(a« 


i O.4I2) 


0.997 


0.002 




0.001 


0.152 


0.695 


0.601 


0.139 


0.499 


0.183 


xp-- 


= 0.72, 


p = 0.5 


(a« 


i 0.792) 


0.899 


0.065 




0.036 


0.479 


0.701 


0.624 


0.460 


0.699 


0.664 




0.4952, 


p = -0.5 


(« 


= 72) 


0.008 


0.019 




0.973 


0.385 


0.885 


0.804 


0.382 


0.496 


0.495 ^ 



Table 1: Summary of the results obtained in the simulation study; the rows relative to combinations of parameters that generate high skewness are 
highlighted in gray. 





A PMC 



A ML 

¥i 



Figure 6; ML, CML, Baycsian estimates and true values of the parameters. 
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and the conditional (on z's) maximum likelihood (CML) estimates are: 



. CML 



- CML 



E 



- CML— 

y-xj} |z| 



G 



CML 



-^CML ^CML—. ^CML ~CML—, 

\y-^ -ijj |z|)(y-C -iP |z|)'. 



The CML estimator is to be considered as a benchmark, as it uses an additional 
piece of information which is not available for ML and PMC estimators. 
The PMC estimates are concentrated in a single cloud around the true value 
and they are in close agreement with the CML estimates. Very few points fall 
far from the cloud: it is probably a consequence of the multimodality of the 
posterior distribution. 

- ML 

The second row shows the comparison between ML estimates xjji and 

- CML 

ipi ■ The dashed line is the bisector of the first and third quadrant, and 
the solid line represents the true value of the parameter. Scatterplots of the 
ML estimates reveal the odd behaviour of the likelihood, with points in the 
"genuine" part of the distribution showing a higher variability. 
The last row of Figure |6] show the bivariate scatterplots of the Bayesian point 

- PMC ... . - ML 

estimates ip and of the maximum likelihood estimates ip , with lines 

- ML 

indicating the true values of the parameters. The sampling distribution of xp 
is clearly multimodal. It also shows a larger dispersion and a negative skewness 
for both V'f " and ijj^'^. 



Figure 7: Perplexity measures in the first three iterations of the PMC algorithm for a simu- 
lation. It shows a typical pattern, with a decreasing number of outlier particles. 
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1.1. A real dataset 

As a final illustration of the proposed algorithm, we analyse the returns of 
two stocks in the NYSE composite index, namely the "ABM Industries Incor- 
porated" and "The Boeing Company" for the two decades October 1, 1992 to 
October 1, 2012 (240 monthly observations). Data are available at 
http : //finance . yahoo . com/q/cp?s=7o5ENYA+Components. Data show a mod- 
erate degree of skewncss. 

We have performed a PMC sampler with 25 iterations, 30, 000 particles 
each. Figure [8] displays the raw data and the estimated SN2 density obtained 
from our proposed algorithm (left) and from the ML approach (right). Table [2] 
summarizes the marginal posterior distributions of the parameters. 





ii 


6 


P 




U2 




-02 


1% 


0.040 


0.042 


0.310 


0.085 


0.086 


-0.075 


-0.076 


5% 


0.045 


0.049 


0.369 


0.088 


0.090 


-0.072 


-0.076 


50% 


0.059 


0.061 


0.462 


0.096 


0.095 


-0.064 


-0.065 


95% 


0.068 


0.068 


0.526 


0.101 


0.102 


-0.046 


-0.054 


99% 


0.071 


0.072 


0.559 


0.104 


0.104 


-0.042 


-0.048 



Table 2: Marginal posterior quantiles for the parameters of the model. 



Bayesian estimate ML estimate 




-0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 

ABM Industries Incorporated ABM Industries Inoorporated 



Figure 8: Returns of the stocks and contour plot of the density of the estimated SN2 model, 
using the PMC algorithm (left) and the ML approach (right). 

Figure [9] depicts a typical pattern of an MC estimate of the Bayes factor 
throughout the iterations: at the sixth iteration, a huge jump occurs, proba- 
bly due to the discovery of a region of high posterior density. This causes a 
degeneracy of the particles, the production of non reliable estimates and a rise 
in the perplexity index for that iteration. Once that the new region has been 
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explored, the estimates become stable. The high value of the estimated Bayes 
factor in the sixth iteration will not affect the final estimate, as it will be down- 
weighted through the perplexity index. Using formula the final estimate 
of the Bayes factor is Biq — 500.76, showing overwhelming evidence in favour 
of the skew-normal model compared to the normal one. 



— I 1 1 1 1 1 — 

5 10 15 20 25 30 

Iteration 

Figure 9: Pattern of the estimated Bayes factor over iterations. 
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